Skip to content

fix: batch pending parent root fetches to avoid 300+ sequential round-trips#694

Closed
ch4r10t33r wants to merge 7 commits intomainfrom
fix/checkpoint-sync-chunked-body
Closed

fix: batch pending parent root fetches to avoid 300+ sequential round-trips#694
ch4r10t33r wants to merge 7 commits intomainfrom
fix/checkpoint-sync-chunked-body

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Contributor

Summary

  • Adds pending_parent_roots queue to BeamNode
  • cacheBlockAndFetchParent now enqueues the missing parent root instead of immediately firing an individual blocks_by_root request
  • New flushPendingParentFetches drains the map and sends one batched request for all accumulated roots
  • Flush is called at every natural processing exit point (missing-parent early return, handleGossipProcessingResult, processBlockByRootChunk)

Problem

When a syncing peer walks a long parent chain (e.g. 300 blocks), every missing parent triggered its own blocks_by_root request on a fresh libp2p stream. This flooded both sides with 300+ individual round-trips instead of a single batched one.

Effect

When multiple gossip blocks arrive in the same burst sharing the same missing ancestor, all their parent roots are collected into one map and sent as a single request. For the sequential case (one block reveals the next missing parent), the deduplication in the map also prevents duplicate fetches for the same root.

Test plan

  • zig build test --summary all passes

ch4r10t33r and others added 6 commits March 23, 2026 15:25
When the SSZ state grows beyond ~3 MB the server switches from sending
a Content-Length response to Transfer-Encoding: chunked. The previous
body-reading loop called readSliceShort which internally goes through:

  readSliceShort → readVec → defaultReadVec → contentLengthStream

contentLengthStream accesses reader.state.body_remaining_content_length
but that field is not active for chunked responses (state is 'ready'),
causing a panic:

  thread 1 panic: access of union field 'body_remaining_content_length'
  while field 'ready' is active

Replace the manual request/response loop with client.fetch() using a
std.Io.Writer.Allocating as the response_writer. fetch() calls
response.readerDecompressing() + streamRemaining() which dispatches
through chunkedStream or contentLengthStream correctly based on the
actual transfer encoding used by the server.
After checkpoint sync, incoming blocks may carry attestations whose target
slots predate the finalized anchor. IsJustifiableSlot correctly identifies
these as non-justifiable, but the `try` was propagating the error fatally,
causing the entire block import to fail.

This creates a cascading gap: block N fails → blocks N+1..M fail (missing
parent) → no epoch-boundary attestations accumulate → justified checkpoint
never advances → forkchoice stays stuck in `initing` indefinitely.

Fix: catch InvalidJustifiableSlot and treat it as `false`. The attestation
is then silently skipped via the existing !is_target_justifiable check,
exactly as all other non-viable attestations (unknown source/target/head,
stale slot, etc.) are handled. The block imports successfully, the chain
catches up, and the node exits the initing state.

Update the test that was asserting the old (buggy) error-propagation
behaviour to instead assert that process_attestations succeeds.
After checkpoint sync the forkchoice starts in the initing state and
waits for a first justified checkpoint before declaring itself ready.
The status-response sync handler was checking getSyncStatus() and
treating fc_initing the same as synced — doing nothing.  This created
a deadlock: the node never requested blocks from ahead peers because
it was in fc_initing, and it could never leave fc_initing because no
blocks were imported.

Fix the deadlock in two places:

1. Status-response handler: add an explicit fc_initing branch that
   requests the peer's head block when the peer is ahead of our anchor
   slot.  This mirrors the behind_peers branch but uses head_slot for
   the comparison (finalized_slot is not yet meaningful in fc_initing).

2. Periodic sync refresh: every SYNC_STATUS_REFRESH_INTERVAL_SLOTS (8)
   slots, re-send our status to all connected peers when not synced.
   This recovers from the case where all peers were already connected
   before the fix was deployed, so no new connection event fires and
   the status-response handler would never be re-triggered.
When a block arrives with a missing parent, the old code immediately sent
an individual blocks_by_root request for that single parent root.  A syncing
peer walking a long parent chain (e.g. 300 slots back) would therefore open
300+ separate libp2p streams - one per ancestor - flooding both sides with
individual round-trips.

Replace the immediate fire-and-forget with a deferred queue:

- Add `pending_parent_roots: AutoHashMap(Root, depth)` to BeamNode.
- `cacheBlockAndFetchParent` now enqueues the missing parent root instead of
  calling `fetchBlockByRoots` directly.
- `flushPendingParentFetches` drains the map and issues a single batched
  blocks_by_root request for all accumulated roots.
- The flush is called at every natural exit point: after the missing-parent
  early-return in `onGossip`, at the end of `handleGossipProcessingResult`,
  and at the end of `processBlockByRootChunk`.

When multiple gossip blocks arrive in the same burst with the same missing
ancestor, all their parent roots are now collected and sent as one request
instead of N separate requests.
@ch4r10t33r ch4r10t33r changed the title node: batch pending parent root fetches to avoid 300+ sequential round-trips fix: batch pending parent root fetches to avoid 300+ sequential round-trips Mar 25, 2026
@ch4r10t33r
Copy link
Copy Markdown
Contributor Author

Closing in favour of a clean PR with only the relevant commits.

@ch4r10t33r ch4r10t33r closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant